The main aim is predicting breast cancer patients chance of survival.
- Clean the data
- Augment the data
- Create some plots
- Statistical analysis
- Create the prediction tool
The main aim is predicting breast cancer patients chance of survival.
We are working with a dataset about Breast Cancer that we have obtained from kaggle website
This is the dataset we are working with:
## patient_id education id_healthcenter id_treatment_region ## 111035895969: 1 Diploma :253 1110000154: 14 1110000329:284 ## 111035896483: 1 Elementary :113 1110000280: 11 1110000330:261 ## 111035897677: 1 Middle School: 97 1110000303: 11 1110000331:189 ## 111035897739: 1 Bachelor : 82 1110000181: 10 ## 111035897959: 1 Illiterate : 79 1110000305: 10 ## 111035898167: 1 High School : 55 1110000224: 9 ## (Other) :728 (Other) : 55 (Other) :669 ## hereditary_history birth_date age weight ## 0:310 Min. :1944 Min. :20.00 Min. : 35.00 ## 1:424 1st Qu.:1978 1st Qu.:29.00 1st Qu.: 73.00 ## Median :1985 Median :34.00 Median : 79.00 ## Mean :1982 Mean :36.81 Mean : 78.75 ## 3rd Qu.:1990 3rd Qu.:41.00 3rd Qu.: 87.00 ## Max. :1999 Max. :75.00 Max. :101.00 ## NA's :2 ## thickness_tumor marital_status marital_length pregnency_experience ## Min. :0.0100 0:139 above 10 years:409 0:144 ## 1st Qu.:0.4000 1:595 under 10 years:325 1:590 ## Median :0.6000 ## Mean :0.5757 ## 3rd Qu.:0.8000 ## Max. :1.3000 ## ## giving_birth age_FirstGivingBirth abortion blood taking_heartMedicine ## 1 :364 above 30:428 0:594 A+ :176 0:281 ## 0 :137 under 30:306 1:140 A- :124 1:453 ## 2 :128 AB+ :119 ## 3 : 75 B+ :108 ## 4 : 13 O+ : 78 ## 5 : 12 (Other):128 ## (Other): 5 NA's : 1 ## taking_blood_pressure_medicine taking_gallbladder_disease_medicine smoking ## 0:210 0:343 0:503 ## 1:524 1:391 1:231 ## ## ## ## ## ## alcohol breast_pain radiation_history Birth_control menstrual_age ## 0:454 0:280 0:366 0:261 above 12:304 ## 1:280 1:454 1:368 1:473 not yet : 3 ## under 12:427 ## ## ## ## ## menopausal_age Benign_malignant_cancer condition treatment_age ## above 50: 36 Benign :303 death :351 Min. :20.00 ## not yet :644 Malignant:431 recovered :132 1st Qu.:29.00 ## under 50: 52 under treatment:251 Median :34.00 ## NA's : 2 Mean :36.83 ## 3rd Qu.:41.00 ## Max. :75.00 ## NA's :2
| Before | After |
|---|---|
| The columns are different types | All the columns are considered as doubles |
| 0, 1, 2 values | bolean variables |
| names with /r/n | Clean names |
| Birth date with 3 characters | Birth date with 4 characters |
| Blood type44 | Correct blood types only |
| Weird weight/age correlations | Eliminating people under 18 years old |
For statistical analysis, we have chosen only women.
We have created some plots in order to fully understand the data and we have done some statistical analysis like MCA analysis. The plots are shown in the following point: “Results”
//: # Variables that affect health (medicines, vicious habits) have a great incidence in [comment]: <>breast cancer patients //: # Early menstrual periods before age 12 and starting menopause after age 55 expose women to hormones longer, raising their risk of getting breast cancer
//: # In the radiation history, the patients who have not suffered from radiation have [comment]: <>recovered better than the ones that have had radiation. //: # In most cases, when having taken medicine the recovery is better. //: # In most cases, when having taking medicine the death is higher (no sense). //: # Not drinking alcohol or smoking improves recovery. //: # When taking alcohol and smoking the death is lower (it doesn’t make any sense) //: # These are absolute values, maybe we should calculate some relative values
We have reached the following conclusions